Semi-supervised G2p bootstrapping and its application to ASR for a very under-resourced language: Iban

نویسندگان

  • Sarah Samson Juan
  • Laurent Besacier
  • Solange Rossato
چکیده

This paper describes our experiments and results on using a local dominant language in Malaysia (Malay), to bootstrap automatic speech recognition (ASR) for a very under-resourced language: Iban (also spoken in Malaysia on the Borneo Island part). Resources in Iban for building a speech recognition were nonexistent. For this, we tried to take advantage of a language from the same family with several similarities. First, to deal with the pronunciation dictionary, we proposed a bootstrapping strategy to develop an Iban pronunciation lexicon from a Malay one. A hybrid version, mix of Malay and Iban pronunciations, was also built and evaluated. Following this, we experimented with three Iban ASRs; each depended on either one of the three different pronunciation dictionaries: Malay, Iban or hybrid. Our best results (WER) for Iban ASR (with different lexicon) were as follows: 20.82% (Malay G2P), 21.90% (Iban G2P) and 20.60% (Hybrid G2P). Apart from that, we applied system combination using all of the systems and obtained an improved accuracy of 19.22%.

برای دانلود رایگان متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

Fast Bootstrapping of Grapheme to Phoneme System for Under-resourced Languages - Application to the Iban Language

This paper deals with the fast bootstrapping of Grapheme-to-Phoneme (G2P) conversion system, which is a key module for both automatic speech recognition (ASR), and text-to-speech synthesis (TTS). The idea is to exploit language contact between a local dominant language (Malay) and a very under-resourced language (Iban spoken in Sarawak and in several parts of the Borneo Island) for which no res...

متن کامل

Sixth International Joint Conference on Natural Language Processing Proceedings of the Fourth Workshop on South and Southeast Asian Natural Language Processing

This paper deals with the fast bootstrapping of Grapheme-to-Phoneme (G2P) conversion system, which is a key module for both automatic speech recognition (ASR), and text-to-speech synthesis (TTS). The idea is to exploit language contact between a local dominant language (Malay) and a very under-resourced language (Iban spoken in Sarawak and in several parts of the Borneo Island) for which no res...

متن کامل

Using resources from a closely-related language to develop ASR for a very under-resourced language: a case study for iban

This paper presents our strategies for developing an automatic speech recognition system for Iban, an under-resourced language. We faced several challenges such as no pronunciation dictionary and lack of training material for building acoustic models. To overcome these problems, we proposed approaches which exploit resources from a closely-related language (Malay). We developed a semi-supervise...

متن کامل

SMT-based ASR domain adaptation methods for under-resourced languages: Application to Romanian

This study investigates the possibility of using statistical machine translation to create domainspecific language resources. We propose a methodology that aims to create a domain-specific automatic speech recognition (ASR) system for a low-resourced language when in-domain text corpora are available only in a high-resourced language. Several translation scenarios (both unsupervised and semi-su...

متن کامل

Analysis of Mismatched Transcriptions Generated by Humans and Machines for Under-Resourced Languages

When speech data with native transcriptions are scarce in an under-resourced language, automatic speech recognition (ASR) must be trained using other methods. Semi-supervised learning first labels the speech using ASR from other languages, then re-trains the ASR using the generated labels. Mismatched crowdsourcing asks crowd-workers unfamiliar with the language to transcribe it. In this paper, ...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

عنوان ژورنال:

دوره   شماره 

صفحات  -

تاریخ انتشار 2014